ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection

نویسندگان

  • Cristian Grozea
  • Christian Gehl
  • Marius Popescu
چکیده

In this paper we describe a new general plagiarism detection method, that we used in our winning entry to the 1st International Competition on Plagiarism Detection, the external plagiarism detection task, which assumes the source documents are available. In the first phase of our method, a matrix of kernel values is computed, which gives a similarity value based on n-grams between each source and each suspicious document. In the second phase, each promising pair is further investigated, in order to extract the precise positions and lengths of the subtexts that have been copied and maybe obfuscated – using encoplot, a novel linear time pairwise sequence matching technique. We solved the significant computational challenges arising from having to compare millions of document pairs by using a library developed by our group mainly for use in network security tools. The performance achieved is comparing more than 49 million pairs of documents in 12 hours on a single computer. The results in the challenge were very good, we outperformed all other methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Encoplot - Tuned for High Recall (also Proposing a New Plagiarism Detection Score)

This article describes the latest changes to our plagiarism detection system Encoplot. We have sent the modified system to the PAN@CLEF 2012 automatic detection of plagiarism challenge, where it ranked 2nd by the F-measure and 3rd by the “plagdet“ scoring method that we had previously shown to be flawed to some extent. The main changes have been done to the heuristic that tries to recognize the...

متن کامل

Encoplot - Performance in the Second International Plagiarism Detection Challenge - Lab Report for PAN at CLEF 2010

Our submission this year is generated by the same method Encoplot that we have developed for the last year competition. There is a single improvement, we compare in addition each suspicious document with each other and flag the passages most probably in correspondence as intrinsic plagiarism.

متن کامل

The Encoplot Similarity Measure for Automatic Detection of Plagiarism - Notebook for PAN at CLEF 2011

This paper describes the evolution of our method Encoplot for automatic plagiarism detection and the results of the participation to the PAN’11 competition. The main novelties are the introduction of a new similarity measure and of a new ranking method, which cooperate to rank much better the source– suspicious document pairs when selecting the candidates for the detailed analysis phase. We hav...

متن کامل

Who's the Thief? Automatic Detection of the Direction of Plagiarism

Determining the direction of plagiarism (who plagiarized whom in a given pair of documents) is one of the most interesting problems in the field of automatic plagiarism detection. We present here an approach using an extension of the method Encoplot, which won the 1st international competition on plagiarism detection in 2009. We have tested it on a large-scale corpus of artificial plagiarism, w...

متن کامل

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection

The task of plagiarism detection entails two main steps, suspicious candidate retrieval and pairwise document similarity analysis also called detailed analysis. In this paper we focus on the second subtask. We will report our monolingual plagiarism detection system which is used to process the Persian plagiarism corpus for the task of pairwise document similarity. To retrieve plagiarised passag...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009